70 research outputs found
Quality Assessment of Linked Datasets using Probabilistic Approximation
With the increasing application of Linked Open Data, assessing the quality of
datasets by computing quality metrics becomes an issue of crucial importance.
For large and evolving datasets, an exact, deterministic computation of the
quality metrics is too time consuming or expensive. We employ probabilistic
techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient
estimation for implementing a broad set of data quality metrics in an
approximate but sufficiently accurate way. Our implementation is integrated in
the comprehensive data quality assessment framework Luzzu. We evaluated its
performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding
Cache-oblivious index for approximate string matching
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin
A Bulk-Parallel Priority Queue in External Memory with STXXL
We propose the design and an implementation of a bulk-parallel external
memory priority queue to take advantage of both shared-memory parallelism and
high external memory transfer speeds to parallel disks. To achieve higher
performance by decoupling item insertions and extractions, we offer two
parallelization interfaces: one using "bulk" sequences, the other by defining
"limit" items. In the design, we discuss how to parallelize insertions using
multiple heaps, and how to calculate a dynamic prediction sequence to prefetch
blocks and apply parallel multiway merge for extraction. Our experimental
results show that in the selected benchmarks the priority queue reaches 75% of
the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the
speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure
in indexed string matching. It can be used to speed up searching using the
suffix array (SA) and provides an implicit representation of the topology of an
underlying suffix tree. The LCP array of a string of length can be
represented as an array of length words, or, in the presence of the SA, as
a bit vector of bits plus asymptotically negligible support data
structures. External memory construction algorithms for the LCP array have been
proposed, but those proposed so far have a space requirement of words
(i.e. bits) in external memory. This space requirement is in some
practical cases prohibitively expensive. We present an external memory
algorithm for constructing the bit version of the LCP array which uses
bits of additional space in external memory when given a
(compressed) BWT with alphabet size and a sampled inverse suffix array
at sampling rate . This is often a significant space gain in
practice where is usually much smaller than or even constant. We
also consider the case of computing succinct LCP arrays for circular strings
Fading histograms in detecting distribution and concept changes
The remarkable number of real applications under
dynamic scenarios is driving a novel ability to generate and
gatherinformation.Nowadays,amassiveamountofinforma-
tion is generated at a high-speed rate, known as data streams.
Moreover, data are collected under evolving environments.
Due to memory restrictions, data must be promptly processed
and discarded immediately. Therefore, dealing with evolving
data streams raises two main questions: (i) how to remember
discarded data? and (ii) how to forget outdated data? To main-
tain an updated representation of the time-evolving data, this
paper proposes fading histograms. Regarding the dynamics
of nature, changes in data are detected through a windowing
scheme that compares data distributions computed by the
fading histograms: the adaptive cumulative windows model
(ACWM). The online monitoring of the distance between
data distributions is evaluated using a dissimilarity measure
based on the asymmetry of the Kullback–Leibler divergence.The experimental results support the ability of fading his-
tograms in providing an updated representation of data. Such
property works in favor of detecting distribution changes
with smaller detection delay time when compared with stan-
dard histograms. With respect to the detection of concept
changes, the ACWM is compared with 3 known algorithms
taken from the literature, using artificial data and using pub-
lic data sets, presenting better results. Furthermore, we the
proposed method was extended for multidimensional and the
experiments performed show the ability of the ACWM for
detecting distribution changes in these settings
Refinement type contracts for verification of scientific investigative software
Our scientific knowledge is increasingly built on software output. User code
which defines data analysis pipelines and computational models is essential for
research in the natural and social sciences, but little is known about how to
ensure its correctness. The structure of this code and the development process
used to build it limit the utility of traditional testing methodology. Formal
methods for software verification have seen great success in ensuring code
correctness but generally require more specialized training, development time,
and funding than is available in the natural and social sciences. Here, we
present a Python library which uses lightweight formal methods to provide
correctness guarantees without the need for specialized knowledge or
substantial time investment. Our package provides runtime verification of
function entry and exit condition contracts using refinement types. It allows
checking hyperproperties within contracts and offers automated test case
generation to supplement online checking. We co-developed our tool with a
medium-sized (3000 LOC) software package which simulates
decision-making in cognitive neuroscience. In addition to helping us locate
trivial bugs earlier on in the development cycle, our tool was able to locate
four bugs which may have been difficult to find using traditional testing
methods. It was also able to find bugs in user code which did not contain
contracts or refinement type annotations. This demonstrates how formal methods
can be used to verify the correctness of scientific software which is difficult
to test with mainstream approaches
Hybrid Statistical Estimation of Mutual Information for Quantifying Information Flow
Analysis of a probabilistic system often requires to learn the joint probability distribution of its random variables. The computation of the exact distribution is usually an exhaustive precise analysis on all executions of the system. To avoid the high computational cost of such an exhaustive search, statistical analysis has been studied to efficiently obtain approximate estimates by analyzing only a small but representative subset of the system's behavior. In this paper we propose a hybrid statistical estimation method that combines precise and statistical analyses to estimate mutual information and its confidence interval. We show how to combine the analyses on different components of the system with different precision to obtain an estimate for the whole system. The new method performs weighted statistical analysis with different sample sizes over different components and dynamically finds their optimal sample sizes. Moreover it can reduce sample sizes by using prior knowledge about systems and a new abstraction-then-sampling technique based on qualitative analysis. We show the new method outperforms the state of the art in quantifying information leakage
- …